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ABSTRACT . Financial forecasting is a difficult task due to the intrinsic com- 
plexity of the financial system. A simplified approach in forecasting is given 
by "black box" methods like neural networks that assume little about the 
structure of the economy. In the present paper we relate our experience using 
neural nets as financial time series forecast method. In particular we show that 
a neural net able to forecast the sign of the price increments with a success 
rate slightly above 50 percent can be found. Target series are the daily closing 
price of different assets and indexes during the period from about January 
1990 to February 2000. 

KEYWORDS: Forecasting, Neural Networks, Financial Time Series, Detrend- 
ing Analysis. 



1. Introduction 

Forecasting future values of an asset gives, besides the straightforward profit 
opportunities, indications to compute various interesting quantities such as the 
price of derivatives (complex financial products) or the probability for an adverse 
mode which is the essential information when assessing and managing the risk 
associated with a portfolio investment. 

Forecasting the price of a certain asset (stock, index, foreign currency, etc.) on 
the ground of available historical data, corresponds to the well known problem in 
science and engineering of time series prediction. While many time series may be 
approximated with a high degree of confidence, financial time series are found 
among the most difficult to be analyzed and predicted. This is not surprising 
since the dynamics of the markets following at least the semi-strong EMH should 
destroy any easy method to estimate future activities using past informations. 

Among the methods developed in Econometrics as well as other disciplines 
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the artificial Neural Networks (NN) are being used by "non-orthodox" scien- 



tists as non-parametric regression methods ( 


Campbell, Lo and MacKinlay, 1997; 


fVIoody and Neuneier + Zimmcrmann, 1998 


). They constitute an alternative 



MacKinlay 1997). The advantage of using a neural network as non linear function 
approximator is that it appears to be well suited in areas where the mathemat- 
ical knowledge of the stochastic process underlying the analyzed time scries is 
unknown and quite difficult to be rationalized. Besides, it is important to note 
that the lack of linear correlations in the financial price series and the already 
accepted evidence of an underlying process different from i.i.d. noise point out to 
the existence of higher-order correlations or non-linearities. It is this non-linear 
correlation that the neural net may eventually catch during its learning phase. If 
some macroscopic regularities, arising from the apparently chaotic behaviour of 
the large amount of components are present, then a well trained net could iden- 
tify and "store" them in its distributed knowledge representation system made 



by units and synaptic weights (Moody and Neuneier + Zimmermann, 1998 



Refenes, Burgess and Bentz, 1997) 



In the following we will see that a well suited NN for each of a set of price time 
series showing a "surprising" rate of success in predicting the sign of the price 
change on a daily base can be found. Not less interesting, we will see that the 
foretold regularities in the time series seem to be more present on larger time 
scale than on high frequency data, as the performance of the net degrades if we 
go from monthly to minutes data. 



2. Multi-layer Perceptron 

Multi-layer perceptrons (MLP) are the neural nets usually referred to as func- 
tion approximators. A MLP is a generalization of Rosenblatt's perceptron (1958); 
rij input units, hidden and n output units with all feed forward connections 
between adjacent layers (no intra-layer connections or loops). Such net's topology 
is specified as rtj-n^-n . 

A NN may perform various tasks connected to classification problems. Here 
we are mainly interested in exploiting what is called the universal approximation 
property, that is, the ability to approximate any nonlinear function to any arbi- 



trary degree of accuracy with a suitable number of hidden units (White, 1992 



Cybenko, 1989) 



The approximation is performed finding the set of weights connecting the 
units. This can be done with one of the available methods of non-parametric 
estimation techniques like nonlinear least-squares. In particular we choose error 
back propagation which is probably the most used algorithm to train MLPs 



(Rumclhart, Hinton and Williams, ; Rumclhart, Hinton and Williams, 1986). It 



is basically a gradient descent algorithm of the error computed on a suitable 
learning set. A variation of it use bias, terms and momentum as characteristic 



t see the vast bibliography with more than 800 entries at 

ww. stern .nyn . erin/ a wei (Tand/Time-Sei-i es/Bibli o/SFTbib. html reported from (Weigend 
and Gershenfeld, 1994) 
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learning set and forecast on the test set 




Figure 1: Each time series is divided in four data sets: learning, validation, check- 
ing and testing (see text for explanation). A difficulty arise from the fact that 
the oscillations in the test set are much more pronounced than in the learning 
set. In figure, daily closing price of Intel Corp. 



parameters. Moreover we fixed the learning rate rj = 0.05, the momentum (3 = 0.5 
and the usual sigmoidal as nonlinear activation function. 



3. Detrending analysis 

We have trained the neural nets on "detrended" time series. The detrending 
analysis was performed to mitigate the unbalance between the learning set, and 
the test set. In fact, subdividing the available data in learning set and testing set 
as specified in the following section (have a look at figure [l]), we train the nets 
on a data set corresponding to a periods much back in time while we test the 
nets on data set corresponding to the most recent period of time. This problem 



is know in literature as noise/nonstationarity tradeoff (Moody and Neuneier 
Zimmcrmann, 1998). 

It is known that during the last ten years the American market has noticeably 
changed in that almost all the titles connected to the information technology 
have not only jumped to record values but also the fluctuations of price today 
are much stronger than before ^. Ignoring this fact would lead to a mistake 
because the net would not learn the characteristics of the "actual situation" . 



t Pt is what we use to train our nets. Considering log(Pt) instead of Pt would mitigate the 
problem but it would introduce further nonlinearities 
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Figure 2: S&P500 detrended time series. The plot shows the original series, the 
polynomial fit and the resulting detrended time series obtained just by difference 
between the original and the fitting curve. The detrended time series consist of 
2024 points. 



To detrend a time series we per formed a nonlinear least squares fit using 
the Marquardt-Levenberg algorithm (|Campbcll, Lo and MacKinlay, 1997 ; Press 



Teukolsky, Vetterling and Flannery, 1994) with a polynomial of sixth degree 



Then we just computed the difference of the series with the fitting curve. For 
each time series considered we ended up with a detrended series composed by 
2024 points corresponding to the period from about January 1990 to February 
2000. For example, the plot in figure ^ shows the detrended time series of the 
index S&P500 along with the original series and the polynomial fit. 

We choose daily closing for 3 indexes and 14 assets historical series on the 
NYSE and Nasdaq. In particular the assets were chosen among the most active 
companies in the field of information technology. 



4. Determining the net topology 

One of the primary goals in training neural networks is to ensure that the 
network will perform well on data that it has not been trained on (called "gen- 
eralization"). The standard method of ensuring good generalization is to divide 
our training data into multiple data sets. The most common data sets are the 
learning L, cross validation V, and testing T data sets. While the learning data 
set is the data that is actually used to train the network the usage of the other 
two may need some explanation. 
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■to 




Figure 3: A three layer perceptron 3 — 7—1 with three inputs, seven hidden and 
one output units. 



Like the learning data set, the cross validation data set is also used by the 
network during training. Periodically, while training on the learning data set, 
the network is tested for performance on the cross validation set. During this 
testing, the weights are not trained, but the performance of the network on 
the cross validation set is saved and compared to past values. If the network 
is starting to overtrain on the training data, the cross validation performance 
will begin to degrade. Thus, the cross validation data set is used to determine 
when the network has been trained as well as possible without overtraining (e.g., 
maximum generalization) . 

Although the network is not trained with the cross validation set, it uses the 
cross validation set to choose a "best" set of weights. Therefore, it is not truly 
an out-of-sample test of the network. For a true test of the performance of the 
network the testing data set T is used. This data set is used to provide a true 
indication of how the network will perform on new data. 

In figure ||, an example of MLP with n, = 3, = 7 and one output unit 
takes P to , Pt 1 , Pt 2 in input and gives the successive value P ts as forecast. The 
number of free parameters is given by the number of connections between units 
(m + n ) ■ n h . 

While the choice of one output unit comes from the straightforward defini- 
tion of the problem, a crucial question is "how many input and hidden units 
should we choose?" . In general there is no way to determine apriori a good net- 
work topology. It depends critically on the number of training examples and the 
complexity of the time series we are trying to learn. To face this problem a large 
number of methods are being developed (recurrent networks, model selection and 
pruning, sensitivity analysis (Moody and Neuneier + Zimmermann, 1998] )), some 
of which follow the evolution's paradigm (Evolutionary Strategies and Genetic 
Algorithm) . 
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Because we have observed a critical dependence of the performance of the net 
from m and rih, and to avoid the great complexity of more powerful strategies 
(Moody and Neuneier + Zimmcrmann, 1998), we ended up with the decision to 
explore all the possible combinations of Ui-Uh in a certain range of values. Our 
"brute force" procedure consists of training nets of different topologies (varying 
2 < Ui < 15 and 2 < < 25) and observe their performance. More precisely we 
select good nets on the basis of the mean square error (see eq(p~l])) computed 
on 200 points out of the sample set constituting the test set. Thus, besides the 
separation in Learning- Validation- Testing of our time series, we further distin- 
guish a subset from the Testing set: the Checking C (see fig. |l|). The reason is 
that while we train the net to interpolate the time series (minimizing the mean 
square error) we finally extrapolate to forecast the sign of the increments (to be 
defined later). 

To assess the efficiency of the learning and to discard bad trained nets during 
the search procedure we use the mean square error e defined as 

where P± is the price value, Gt is the forecasted value at time t S C and a is 
the standard deviation of the time series. For good forecasts we will have small 
positive values of e (1 > e > 0). 

We set the threshold 0.015 to discriminate good from bad nets. Only those 
nets for which e < 0.015 are further tested for sign prediction. 

In summary, first we learn on set L, and through validation V we find when to 
stop learning; then through check on C we see if the learning process worked well, 
and in case it did, we make predictions in the test phase on set T for "future" 
(i.e. previously unused) price changes and compare them with reality. 



5. Stopping criteria 

To avoid overfitting and/or very slow convergence of the training phase, the 
stopping criteria is determined by the following three conditions, one of which is 
sufficient to end the training phase (early stopping): 



1 Stopping is assured within 5000 iterations of cross validation (see section 

2 during cross validation the mean square error on the validation set V is 
computed as ey = \ Yltev i^t — > during training ey should decrease, 
so a stopping condition is given if ey increase again more than 20% of the 
minimum value reached up to then; 

3 learning is also stopped if ey reaches a plateau; this is tested during cross 
validation averaging 1000 successive values of ey and checking if the actual 
value is above this average. 



6. Results 

The plot in figure || compares the forecasted Gt and the real Pt values for 
the time series of Apple Corp. on the test set T. It also shows a linear fit for 
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the points {Pt,Gt}. A raw measure of performance on the test set T can be 
obtained by the slope of the fitting line (let's call it 9). It will be a value close 
to one if the fit corresponds to the y = x line, i.e., if Pt — Gt- We obtained the 
following d's for the time series in table ^ and |^: #s&P5oo = 0.906, 9 Dn = 0.874, 
N asda q ioo = 0.860. 9 AAPh = 0.976, 9 T = 0.921, 9 AMD = 0.914, 9 STM = 0.885, 
HON = 0.885, 9 lNTC = 0.874, 9 csco = 0.860, 9 WCOM = 0.847, 9 lBM = 0.842, 
0ORCL = 0.824, 9 MSFT = 0.803, 9 SUNW = 0.774, 9 DELL = 0.692, ^ QCOM - 0.488. 




Figure 4: Forecast of the time series AAPL. Price is expressed in US$. A perfect 
forecast will be represented by dots on the y = x line (shown as the continuous 
line). The dashed line is a linear fit of the points {Pt, Gt}. A raw measure of the 
error in forecasting is given by the angular coefficient of the fitting line. Values 
close to one indicate G t — Pt- 



The final estimation of the performance in forecasting is made by means of the 
one-step sign prediction rate £ defined on T as follows 

£ = w\J2 HS( - APt ■ AG *) + 1 - HS (\ Ap t\ + i AG *i) (e- 1 ) 

where AP t — Pt — Pt-x the price change at time step t € T and AGt — Gt — Pt-i 
is the guessed price change at the same time step. Note that we assume to 
know the value of Pt-i to evaluate AGt. HS is a modified [j] Heaviside func tion 
HS(x) = 1 for x > and otherwise. The argument of the summation in eq( |6.l| ) 

t The usual HS function gives 1 in zero, i.e., HS(0) = 1 
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gives one only if AP f and AGt are non-zero and with same sign, or if AP t and 
AGt are both zero. In other words ( is the probability of a correct guess on the 
sign of the price increment estimated on T . 

In the lower- right inset of figure |4| it is shown P t — Gt as function of P t . One 
can see that the difference between the real and the forecasted values clusters 
for small P t . Another way see it is to look at the histogram of £ as function of 
AP t . In other words the rate of correct guesses on the sign of the price increment 
relative to the magnitude of the fluctuation of the real price. To obtain an unbi- 




-50 -40 -30 -20 -10 10 20 30 40 50 

AP 



Figure 5: Normalized C as function of AP (arbitrary units). The the sign pre- 
diction rate seems independent from the magnitude of the price change |AP|. 



ased histogram we have to normalize it dividing each bin by the corresponding 
value of the AP's histogram (the limit of AP follows a power law so that large 
fluctuations are much less probable) . The resulting distribution is plotted in fig- 
ure It is now clearly visible that the net does not favor large increments over 
small ones or vice versa. In fact the probability to make a correct guess on the 
sign of the increment seems independent from the magnitude of the increment 
itself. This does not means that the net forecasts "rare events" (i.e., a profit op- 
portunity) as easily as normal fluctuation, because the statistics here calculated 
are not significant with respect to extreme events. 

To interpret the results that we are going to show we have to concentrate our 
attention on the way we select a good net to be used to make forecast. For each 
time series we have performed a search to determine the topology of a good net 
as specified in the last section. Once we get a pool of candidates the question is 
"how many of them give a sign prediction rate above fifty percent?" 

This question is answered in table There, tot indicates the number of nets 
such that e < 0.015, that is, we judged as good nets, while ok is the number of 
them that gave C > 50. This ratio can be seen as an estimation of the confidence 
that the net will perform a "sufficient" forecast of price change, where sufficient 
means above fifty percent. 

The value of £ together with the specification of the number of units per layer 



Forecasting price increments with NN 9 



Table 1: Here tot indicates the number of nets such that e < 0.015, that is, we 
judged as good nets, while ok is the number of them that gave a sign prediction 
rate £ above 50 percent. 



Series 


ok /tot 


Series 


ok /tot 


S&P500 


32/54 


DowJones Ind 


189/450 


Nasdaq 100 


45/86 






SUNW 


112/112 


DELL 


69/69 


WCOM 


76/76 


AAPL 


309/311 


INTC 


46/46 


AMD 


244/245 


STM 


33/269 


ORCL 


35/35 


MSFT 


21/21 


IBM 


9/9 


csco 


39/48 


HON 


22/82 


T 


6/6 


QCOM 


43/436 



Table 2: For each index the net topology nt 
(, \L\ and 



rih — 1 is specified along with e, 



\V\. \T\ = 2024 


- (rii 


+ \L\ 


+ \v\ 


+ |C|) and \C\ 


= 200. 


Symbol 


rii 


n h 


\L\ 


\V\ <L 


C(%) 


S&P500 


8 


2 


500 


300 0.008938 


52.272727 


DowJones Ind 


13 


2 


700 


200 0.012074 


51.488423 


Nasdaq 100 


4 


25 


700 


200 0.014182 


50.982533 



of the best net is reported in table || and table || along with the dimension of the 
learning and validation set. 

The sign prediction rates range from 50.29% to 54%. While the smallest values 
50.29 may be questionable, the larger values above 54 seem a clear indication 
that the net is not behaving randomly. Instead it has captured some regularities 
in the nonlinearities of the series. 

A quite direct test for randomness can be done computing the probability that 
such forecast rate can be obtained just by flipping a coin to decide the next price 
increment. For this purpose we use a random walk (pr(up) = pr(down) = 1/2) 
as forecasting strategy G rWt and observe how many, over 1000 different random 
walks, give a sign prediction rate C, rw defined in eq( |6.l[ ) above the value obtained 
with our net. Note that each random walk perform about 1000 time steps, the 
same as \T\ for that specified time series (see table || and ||). These values are 
reported in table ||. They indicate that except for QCOM the random walk 
assumption cannot give the same prediction rate as the neural net [j]. 



t In other words, given a neural net which produce f as prediction rate over a certain 
time series Pt we may compute the probability at which the null hypothesis of randomness 
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Table 3: Success ratio for the prediction of the sign change. For each asset the 
net topology is specified along with e, £ and the number of points in the learning 
and validation set. In the second column it is specified the symbols from the 
respective stock exchange NYSE(o) or Nasdaq(»). 



Company 


Symbol 


rii 


in, 


\L\ 


\v\ 


e 


C(%) 


• 


Sun Microsys 


SUNW 


9 


7 


500 


300 


0.014435 


54.005935 


• 


Dell Computer 


DELL 


4 


18 


500 


300 


0.004315 


53.543307 


• 


Mci Worldcom 


WCOM 


3 


2 


500 


300 


0.004024 


53.392330 


• 


Apple Comp Inc 


AAPL 


5 


17 


700 


300 


0.013786 


53.374233 


• 


Intel Corp 


INTC 


6 


6 


500 


300 


0.009953 


53.254438 


o 


Adv Micro Device 


AMD 


4 


23 


500 


300 


0.012339 


52.952756 


o 


ST Microelectron 


STM 


6 


2 


500 


300 


0.003978 


52.465483 


• 


Oracle Corp 


ORCL 


6 


2 


500 


300 


0.006333 


52.366864 


• 


Microsoft Cp 


MSFT 


10 


4 


500 


300 


0.008327 


52.277228 


o 


Intl Bus Machine 


IBM 


10 


6 


500 


300 


0.006642 


52.079208 


• 


Cisco Systems 


csco 


4 


14 


500 


300 


0.008364 


51.968504 


o 


Honeywell Intl 


HON 


8 


2 


600 


200 


0.008506 


51.877470 


o 


AT&T 


T 


3 


22 


500 


300 


0.014920 


51.327434 


• 


Qualcomm Inc 


QCOM 


4 


25 


500 


300 


0.009888 


50.295276 



Table 4: For every sign prediction rate £ reported in table |2| and [3|it is here shown 
the number of random walks (over 1000) that have totalized a sign prediction 
rate £ rw greater or equal £. 



Series 


#ro : (rw > C 


Series 




S&P500 


78 


Dow Jones Ind 


186 


Nasdaq 100 


258 






SUNW 


7 


DELL 


16 


WCOM 


13 


AAPL 


25 


INTC 


21 


AMD 


30 


STM 


50 


ORCL 


69 


MSFT 


76 


IBM 


103 


CSCO 


98 


HON 


108 


T 


194 


QCOM 


431 
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7. Weekly and intra-day data 

It is interesting to ask if the MLP may exploit regularities in the time se- 
ries of price sampled at a lower/higher rate than daily. Apart from the "scaling 
behaviour" observed empirically in real price series we are interested in the per- 
formance of our procedure (search plus learn) when we change the time scale on 
which we sample the price of the assets or the index at a stock market. 

To answer this question we performed the same search for the good net on the 
IBM and AMD stock price sampled on weekly basis as well as taking intra-day 
data with the frequency of one minute. Both series consisted of 2024 points, the 
same as the daily price series. 

The outcome is that intra-day data are much difficult to be forecasted with 
our MLPs. In fact for both the one-minute-delay data series the search did not 
succeeded to find a good net; all the good nets (few) have given a sign prediction 
rate C < 40%. 

On the other hand the forecast of weekly data gave a success rate comparable 
with that of daily series (e.g., a 4-2-1 net performed £ = 51.422764 with e = 
0.004947). 



8. Artificially generated price series 

As last question, and to further test the correctness of our prediction, we 
tried to forecast the sign of price changes of an artificially generated time series. 
This was generated by the the Cont-Bouchaud herding model that seems one 



of the simplest one able to show fat tails in histogram of returns (Cont and 



Bouchaud, 1999). This model shows the relation between the excess kurtosis 
observed in the distribution of returns and the tendency of market participants 
to imitate each other (what is called herd behaviour). The model consists of 
percolating clusters of agents ( fgtauffcr, 2000] ). At a given time step a certain 
number of coalitions (clusters) decide what to do: they buy with probability a, 
sell with probability a or stay inactive with probability 1 — 2a. The demand of 
a certain group of traders is proportional to its size and the total price change 
is proportional to the difference between supply and demand. 

It is clear that such a model generates unpredictable time series, and our 

is rejected. We use a random walk (pr(iip) = pr(down) = 1/2) as forecasting strategy G rwt 
and then compute £ rw defined in eq( p.l| ) on the time series Pt. The random variable f ru , 
have mean 0.5 and standard deviation (Tr ■ By definition f rtl) is the sample mean of T i.i.d. 
Bernoullian random variables. Thus, assuming that ( rw converges to a Gaussian N(l/2, (r^ rui ), 

we can estimate the unknown variance of £ al w = l/N J2 = i(& Wi - 1/2) 2 . To have 

an estimation of (T^ rul we ran N = 1000 random walks each giving a value for ( rw . Once 
we estimate a rw , the null hypothesis becomes "what is the probability Pq tu1 [x > Q that the 
neural net is doing a random prediction on Pt with rate f ?" or the other way around "what 
is the probability P( rul [x < £] that the net is not doing randomly?" . Formally P^ rlll [x < 

f] = J N(~, a r w)(%)d% where N(^,a rw ) is a Gaussian and a rw is the estimation of the 
standard deviation a rw of the random variable Crw in summary, for every sign prediction rate 
f obtained with our neural net on a time series Pt, we first estimate a rw as specified above, 
then we compute the probability Pf rvl [x < f ] at which the null hypothesis of randomness 
prediction is rejected. The results tell us that for some bad prediction values (like for QCOM 
or NasdaqlOO) the randomcss hyphothesis cannot be rejected but for the majority of the series 
the probability to reject the null hypothesis is something between 0.01 and 0.1. 
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networks should not be able to make any predictions. Indeed, when our method 
was applied to this series it did not succeeded to find a good net as all the tried 
nets performed bad on the check set C, i.e., e > 0.015. 



9. Discussion 

We have shown that a suitable neural net able to forecast the sign of the price 
increments with a success rate slightly above 50 percent on a daily basis can be 
found. This can be an empirically demonstration that a good net exists but we 
do not have a mechanism to find it with "high probability" . In other words we 
cannot use this method as a profit opportunity because we do not know a priori 
which net to use. Perhaps a better algorithm to search for the good topology 



(model selection an d pruning with sensitivity analysis (Moody and Neuneier 



Zimmermann, 1998)) would give some help. The future work will likely undertake 
this direction. 

As final remark we have found that intra-day data are much more difficult to 
be forecasted with our method than daily or weekly. 
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